CPU

Where I Stopped

Physical Structure

  • .

Manufacturing

Cache

  • Modern CPUs fetch memory in 64-byte cache lines and prefetch adjacent memory.

Performance: Fixed Array (Small Array) vs Dynamic Array

Discussion
  • Considering the struct IK_Chain :

    IK_Chain :: struct {
        joints: [dynamic]^Joint,
        bone_lengths: [dynamic]f32,
        target: eng.Transform_Node,
        is_target_moving: bool,
        placement: IK_Placement,
    }
    
  • A question about performance: I'm storing bone_lengths: [dynamic]f32  as a cache inside a IK_Chain . I opted for a [dynamic]  array as I don't know for sure what the length of a chain will be. Tho, now knowing about the existence of a Small_Array , I question whether I should make this a bone_length: sarray.Small_Array(SOME_REASONABLE_NUMBER, f32) , for cache locality. Seems like a trade-off between memory and speed, as by using a Small_Array I will overestimate the size of this array, to be sure to fit into all major cases of a IK_Chain  I would build. For context, now a IK_Chain  only has 4~5 Joints, but it could have 20+ or more, for some creatures. I update the IK_Chain  every frame, for visuals. So, using a small array with ~20 cap would be a nice trade off, instead of using a dynamic array? Oh, probably relevant, but using this for bone_lengths would also imply that I would also use this for the joints: [dynamic]^Joint  in the IK_Chain ; consider that the ^Joint are on the stack.

  • TLDR :

    • SmallArray wins because the entire struct + bone data often fits in 1-2 cache lines

Location and continuity
  • Dynamic Array :

    • A [dynamic]f32  stores its metadata (pointer, length, capacity) in the struct, but the actual data is heap-allocated.

    • When you do make([dynamic]f32, 0, capacity) , Odin allocates a contiguous block of memory on the heap. All elements are stored sequentially in this block:

    • [elem0][elem1][elem2]...

    • As long as you don't exceed this, you're going to have the values in a fixed contiguous region.

    • Indirection / Pointer chasing.

  • Small_Array :

    • Is embedded directly in the struct (whether the struct is on stack/heap depends on context).

    • Elements are also contiguous, but embedded within the parent struct.

  • Example :

    • Ex1: Array of structs :

      chains: [100]IK_Chain
      
      • Dynamic Array :

        • Only struct metadata is contiguous. Actual data is fragmented:

          [Chain0] β†’ (heap_ptr0 β†’ bones0)
                    β†’ (heap_ptr1 β†’ joints0)
          [Chain1] β†’ (heap_ptr2 β†’ bones1)
                    β†’ (heap_ptr3 β†’ joints1)
          
      • Small_Array :

        • All data (struct fields + bone lengths + joints) is in one contiguous memory block.

          [Chain0][bones0][joints0][Chain1][bones1][joints1]...
          
    • Ex2: Single struct :

      • Dynamic Array :

        • Requires at least two separate memory loads:

          • Load IK_Chain  struct (metadata)

          • Load bone data from heap pointer

      • Small_Array :

        • All data loaded in 1-2 cache lines.

  • They are both backing arrays fixed at point in memory and should benefit from caching.

  • "What about the joints ?" :

    • The [dynamic]^Joint  can be problematic due to double indirection (pointer to array + pointer to Joint).

      • This is worse than just dynamic arrays of values.

Dynamic Array reallocation
  • Reallocations are rare if capacity is preset, but when they occur:

    • Invalidates all caches for bone data.

    • Costs CPU cycles for allocation/memcpy.

  • A Small_Array avoids this entirely.

Deciding between memory efficiency and performance
  • Considerations :

    • If you only need AI data in one system and skeletons in another β†’ Cache locality benefits diminish.

    • If you allocate for 50 creatures but typically have 10-15 β†’ 70-80% memory wasted.

    • Bad if creature size > cache line (typically 64-128B).

  • DO use Small_Array for :

    • Core metadata (position, health).

    • Hot components (AI state, IK chains).

    • Fixed-size subsystems.

  • AVOID Small_Array for :

    • The entire Creature struct.

    • Large/variable data (animations).

    • The creatures container itself.

SIMD

CPU ARM

NEON

NEON

  • Vector width:  128-bit registers

  • Typical lane count:  4 lanes Γ— 32-bit (e.g., 4 Γ— float32 )

SVE / SVE2 (Scalable Vector Extension)
  • Vector width:  Variable.

    • Register width is not fixed (128–2048 bits in 128-bit steps).

    • Code is vector-length agnostic (designed to scale across cores).

  • Typical lane count:  8 lanes Γ— 32-bit (8 Γ— float32 )

CPU x86/x64 (Intel / AMD)

FMA3 / FMA4 (Fused multiply-add)
  • Is often used in combination with AVX/AVX2/AVX-512.

  • Platform:  x86/x64 (Intel & AMD)

  • Vector width:  256-bit registers

  • Typical lane count:  8 lanes Γ— 32-bit (8 Γ— float32 )

AVX-512
  • The adoption is limited (mainly HPC, data centers, or select Intel chips).

  • Includes masking, scatter/gather, and more advanced operations.

  • Platform:  x86/x64 (Intel only in select CPUs, not widely available)

  • Vector width:  512-bit registers

  • Typical lane count:  16 lanes Γ— 32-bit (16 Γ— float32 )

AVX / AVX2 (Advanced Vector Extensions)
  • AVX2 Added full integer support

  • Platform:  x86/x64 (Intel & AMD)

  • Vector width:  256-bit registers

  • Typical lane count:  8 lanes Γ— 32-bit (8 Γ— float32 )

SSE (Streaming SIMD Extensions)
  • Superseded by AVX.

  • SSE1–SSE4 progressively added instructions but retained 128-bit width.

  • Platform:  x86/x64 (Intel & AMD)

  • Vector width:  128-bit registers

  • Typical lane count:  4 lanes Γ— 32-bit (4 Γ— float32 )

MMX
  • Legacy, obsolete.

  • Platform:  x86/x64 (Intel & AMD)

  • Vector width:  64-bit registers

  • Typical lane count:  2 lanes Γ— 32-bit (8 Γ— float32 )

RISC-V (Reduced Instruction Set Computer) (Risk-Five)

  • Is an open, modular instruction set architecture (ISA) based on the RISC (Reduced Instruction Set Computer) design principles.

  • Unlike proprietary ISAs (e.g., x86 by Intel/AMD, ARM by Arm Ltd.), RISC-V is:

    • Open source  β€” Anyone can use or implement it without licensing fees.

    • Modular  β€” It has a minimal base instruction set, with optional extensions (e.g., floating-point, SIMD, vector).

RVV (RISC-V Vector Extension)
  • Similar to ARM SVE, RVV allows hardware to define vector width.

  • Not fixed to 128, 256, or 512 bitsβ€”code adapts dynamically.

  • Scalable width:  Vector registers can be from 128 to 2048 bits, depending on hardware.

  • Vector-Length Agnostic (VLA):

    • Programs don’t assume a fixed vector width.

    • Code adapts to hardware at runtime β€” same binary works on 128-bit or 512-bit hardware.

GPU

  • GPUs use SIMT  (Single Instruction, Multiple Thread), not SIMD per se, but functionally similar at scale.

CUDA
  • NVIDIA GPUs

  • Vector width:  Scalar SIMT

  • Typical lane count:  8 lanes Γ— 32-bit (8 Γ— float32 )

OpenCL
  • Cross-vendor GPU compute

  • Vector width:  Variable

  • Typical lane count:  8 lanes Γ— 32-bit (8 Γ— float32 )

Wavefronts / Warps
  • Used in GPU shaders

  • Vector width:  32/64 threads

  • Typical lane count:  8 lanes Γ— 32-bit (8 Γ— float32 )